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Abstract 

Reformulating computer vision problems over Riemannian manifolds has demonstrated superior 
performance in various computer vision applications. This is because visual data often forms a 
special structure lying on a lower dimensional space embedded in a higher dimensional space. 
However, since these manifolds belong to non-Euclidean topological spaces, exploiting their struc¬ 
tures is computationally expensive, especially when one considers the clustering analysis of massive 
amounts of data. To this end, we propose an efficient framework to address the clustering problem 
on Riemannian manifolds. This framework implements random projections for manifold points 
via kernel space, which can preserve the geometric structure of the original space, but is computa¬ 
tionally efficient. Here, we introduce three methods that follow our framework. We then validate 
our framework on several computer vision applications by comparing against popular clustering 
methods on Riemannian manifolds. Experimental results demonstrate that our framework main¬ 
tains the performance of the clustering whilst massively reducing computational complexity by 
over two orders of magnitude in some cases. 
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1. Introduction 

Clustering analysis is an automated process that groups unlabelled data into subsets (here 
called clusters) that may express the underlying structure of the data. It is one of the most 
critical tools for understanding visual data [1, 2]. For instance, significant amounts of visual data 
such as videos and pictures are uploaded every second [3]. Indeed, this is the case for YouTube 
where 100 hours of video are uploaded every minute [4]. Although these videos have titles and 
some additional meta-information, it is often desirable to automatically group the videos in terms 
of specific criteria such as visual similarity or detected objects. 

In recent years, modelling visual data in analytical manifolds such as Riemannian manifolds 
has enjoyed success in various computer vision application domains such as face recognition [5], 
action recognition [6] and pedestrian detection [7]. This is because visual features and models often 
possess special structures which Euclidean space fails to capture. Riemannian manifolds which 
form curved spaces are a more appropriate approach to model problems in various computer vision 
tasks. 




Unfortunately, despite the fact that clustering methods have been studied since the 1950s [1, 8], 
applying such methods directly on data represented on Riemannian manifolds is not trivial. Rie- 
mannian manifolds generally do not conform to Euclidean space [5, 9]. To address this, one could 
use manifold tangent spaces which are locally homeomorphic to Euclidean space [9]. However, 
this brings another challenge to applying existing clustering algorithms as some general algebraic 
operations are not well dehned [10]. For instance, K-means requires the computation of the mean 
within a cluster which cannot be computed directly. To this end, Pennec et ah [10] reformulated 
the computation of mean as a solution to an optimisation problem. Using this formulation, the 
mean point is considered as the point over the manifold minimising the geodesic distance (i.e. the 
true distance on the manifold between two points) from the mean point to all other points. The al¬ 
gorithm to solve this problem is called Karcher mean [10]. Thanks to the Karcher mean, Turaga et 
al. [5] extended the K-means algorithm into the Riemannian manifold, which is regarded as in¬ 
trinsic K-means and has been applied to activity-based video clustering. Intrinsic K-means has 
further demonstrated better performance than Euclidean-based methods (for example. Protein 
Clustering [11]). 

Generally, methods that completely honour the manifold topology lead to higher accuracy. 
We shall categorise these methods as intrinsic methods. Unfortunately, the computational cost 
of intrinsic methods is extremely high since these need to map all of the data to tangent spaces 
repeatedly. 

Extrinsic methods, on the other hand, seek solutions that may not completely consider the 
manifold topology [12, 13, 14, 15, 16, 17]. The most simplistic way, here called Log Euclidean 
methods, is to embed all of the points into a designated tangent space at the identity point [18]. Log 
Euclidean methods can be considered as flattening the manifold space. It has been used in various 
computer vision applications, such as human action recognition [13] and cell classihcation [14]. 
This addresses the computational cost issues suffered by the intrinsic methods, as the tangent 
space is homeomorphic to the Euclidean space and well-known Euclidean clustering approaches 
such as K-means can be directly applied. Unfortunately, as the flattening step distorts the pair¬ 
wise distances in regions far from the origin of the tangent space, accuracy is severely compromised. 
So much of the value of the manifold approach is lost. 

Other approaches that fall in the extrinsic method category are kernel-based approaches [15, 
16, 17], such as Kernel K-means. In essence, the data in manifold space are hrst embedded into the 
Reproducing Kernel Hilbert Space (RKHS) [19]. As the embedding function is dehned implicitly, 
generally kernel-based approaches make use of the inner products in the RKHS in their formulation. 
These inner products are then arranged in a Gram matrix. It is often observed that the right 
choice of kernel could signihcantly improve the performance [15]. Furthermore, in general, kernel 
inner products with specihed metrics have much less computational complexity than geodesic 
distances [20, 21]. With these properties, kernel-based approaches could be suitable to address 
issues suffered in both the intrinsic approach and the Log Euclidean approach. Unfortunately, the 
kernel-based approaches cannot scale easily, as the Gram matrix computation is O(n^) where n is 
the number of data points. Also, it is often quite challenging to kernelise the existing algorithms 
that do not have known kernelised versions [22]. Furthermore, Nikhil et al. demonstrate that 
clustering data in the RKHS may lead to unexpected results since the clusters obtained in the 
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Table 1: Summary of the existing works compared to our proposal. 


Approach 

Exploits Manifold Structure 

Accuracy 

Gomputational 

Gomplexity 

Intrinsic Methods [5, 11] 

Yes 

High 

High 

Log-Euclidean Methods [12, 13, 14] 

Minimal 

Low 

Low 

Kernel Methods [15, 16, 17] 

Approximately 

High 

Moderate 

Our proposal 

Approximately 

High 

Low 


RKHS may not exhibit the structure of the original data[23]. 

Contributions We summarise the advantages and shortcomings of the existing approaches in 
Table 1. Our goal is to develop an efficient clustering algorithm for Riemannian manifolds, which 
signihcantly reduces the computational complexity, but still maintains acceptable performance. 
The inspirations are drawn from the random projection for Euclidean spaces which has enjoyed 
success in various domains [24, 25, 26] due to its simplicity and theoretical guarantees [27]. We 
list our contributions as follows: 

1. We propose a random projection framework for manifold features. In general, the term 
projection is not well dehned in Riemannian manifolds. Therefore, we address this via the 
RKHS constructed from a small subset of data. Once projected, we choose to apply the 
K-means algorithm. 

2. From our framework, it becomes clear that random hyperplane generation is essential. Thus, 
we describe three generation algorithms which are followed in our framework: (1) Kernelised 
Gaussian Random Projection (KGRP); (2) Kernelised Orthonormal Random Projection 
(KORP) and (3) Kernel Principal Gomponent Analysis Random Projection (KPGA-RP). 

We note that our method is different from manifold learning approaches for clustering anal¬ 
ysis described in [28]. Manifold learning is the collection of non-linear dimensionality reduction 
(NLDR) techniques that seek for a low dimensional representation of a set of high-dimensional 
points lying on a non-linear manifold [29]. They assume the structure of the underlying manifold 
was unknown. Gontrary to this, in our paper, we are interested in Riemannian manifolds whose 
underlying geometry is known. 

We continue the paper as follows. Section 2 provides a brief mathematical background of 
Riemannian manifolds. Section 3 details the proposed random projection framework for manifold 
points and develops three different random projection methods for clustering points on manifold 
spaces. The proposed methods are then contrasted with the state-of-the-art methods in Section 4. 
The conclusions and future directions are summarised in Section 5. 

2. The Geometry of Riemannian Manifolds 

A differentiable manifold A4 is a topological space that is locally similar to Euclidean space [30] . 
One can use the tangent space to model the neighbourhood structure on a differentiable manifold. 
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The tangent space at a point X on the manifold, TxXi, is a vector space that contains all possible 
directions tangentially passing through X [30]. 

A Riemannian manifold is a differentiable manifold, endowed with a Riemannian metric. The 
Riemannian metric is the family of inner products on all of the tangent spaces [31]. This metric 
enables us to dehne geometric concepts such as lengths, angles and distances. The geodesic 
distance between two points is dehned as the length of the shortest curve between X and 

Y [31], 

In this section, we briefly introduce two well known Riemannian manifolds used in the com¬ 
puter vision community, namely Symmetric Positive Definite (SPD) manifold and Grassmannian 
manifold. 

2.1. SPD Manifolds 

To compute a compact representation of an image, one method is to calculate the covariance 
matrix of a set of d-dimensional vector features extracted from the image [32]. Covariance ma¬ 
trices naturally arise in the form of SPD matrices, which can be considered as points on SPD 
manifolds [7]. The geodesic distance between points on SPD manifolds then can be calculated 
through an affine invariant Riemannian metric: 

dist(X,r) = ||log(X-5l-X-5)|||, (1) 

where X,Y G Xi are two points over the SPD manifold. For further discussions on SPD mani¬ 
folds, the readers are referred to [9]. 

To further improve clustering performance, SPD manifolds could be projected into RKHS by 
Mercer kernels. In this paper, we use one of the popular kernels for SPD manifolds, namely the 
Gaussian kernel, which is dehned by: 


K(X, Y) = exp(-/? ■ dist(X, Y)) , (2) 

where dist(X, Y) is the geodesic distance between point X and Y from Eqn.l. Since the geodesic 
distance is computationally demanding, several methods for computing the approximate distance 
have been developed [18, 33, 34]. In this paper, we use two popular approximate distance functions: 
Log Euclidean Distance (LED) [18] and Stein Divergence (SD) [33]. The Gaussian kernel with 
LED and SD then can be respectively formulated by: 


Kled{X,Y) = exp(-/? ■ II log(X) - log(r)|||) 


( 3 ) 


and 


Y) = exp(-/? ■ log det 


X + Y 


--log (det (XI-))) 


( 4 ) 


Note that, in order to become a Mercer kernel, the Gaussian kernel with SD requires (3 to be of 
the form: /d G {4, |,..., 
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2.2. Grassmannian Manifolds 

The Grassmannian Manifold Qq,d, is the set of all d-dimensional subspaces of A point 
on the Grassmann manifold, X G Qq,d, can be denoted by an orthonormal matrix in The 

geodesic distance between points X and on a Grassmannian manifold is dehned as: 

dist(x.y) = y'9? + ... + ep (5) 

where 0* is the principal angle between X and Y. The angle 6i can be calculated by 0* = 
wherein fi are singular values of X^Y. We refer readers to [35] for further treatment 
on Grassmannian manifolds. A popular kernel used over Grassmannian manifolds is known as the 
Projection kernel [36, 37], which can be formulated as: 

K{X,Y) = P-\\X^Y\\l . (6) 


3. Proposed Framework 

As mentioned in Section 1, the goal of our work is to signihcantly reduce clustering computa¬ 
tional complexity for manifold features while maintaining reasonable clustering performance. We 
address this by adopting a random projection approach to Riemannian manifolds. In this section, 
we hrst discuss the overview of random projection in Euclidean space. We then extend the notion 
into the Riemannian manifold space. 

3.1. Random Projection in Euclidean Space 

In Euclidean space, the random projection embeds original data into a much lower dimensional 
space whilst preserving the geometric structure [38]. This can signihcantly reduce the computa¬ 
tional complexity of learning algorithms, such as classihcation or clustering. For instance, as a 
result, this is used to achieve real time performance in object tracking [39]. 

A point a; G in Euclidean space can be projected into a random k-dimensional subspace 
{k « d) via a set of randomly generated hyperplanes where r* G This can be 

formulated as: 


/(x) = x^R , (7) 

where R is the random matrix that arranges the random hyperplanes as column vectors. Note 
that in order to minimise distortions produced by the projection, the matrix R should possess a 
particular property. We introduce this property in Dehnition 3.2. When the random projection 
matrix R possesses such a property, then the Johnson-Lindenstrauss Lemma (JL-Lemma) [40] 
applies. 

Lemma 3.1. [Johnson-Lindenstrauss Lemma [jO]] For any e such that e > 0, and any set of 
points X with \X\ = n upon projection to a uniform random k-dimension subspace where k > 
0(e“^log n), the following property holds for every pair u,v & X , {1 — e)\\u — u|p < ||f(w) — 
< (1 + ~ where i{u),i{v) are the projections of u,v . 
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Remarks The JL-Lemma principally states that a set of high dimensional points can be embedded 
using a set of uniform random hyperplanes into lower dimensional space wherein the pairwise 
distance between two points is well preserved (with high probability). The original proof of JL- 
Lemma uses quite challenging geometric approximation machinery [40]. Frankl and Meahara [41] 
simplihed that proof by considering a projection into k random orthonormal vectors. Recently 
there have been several properties of the random matrix where JL-Lemma still applies. We shall 
call the type of projection wherein the random matrix has properties that allow the JL-Lemma to 
be applied as a JL-Type projection. 

Definition 3.2 (JL-Type projection). Let R = [t"! ■ ■ ■ Tfc], ^ be a random matrix whose 
columns are the random hyperplanes. The projection f('u) = R7u,u G ]R'^,f('u) G is called 
JL-Type projection when the matrix R possesses at least one of the following properties: 

1. The columns of R are orthogonal unit-length vectors [jl]; 

2. Each element in R is selected independently from a standard Gaussian distribution N{0, 1) 
or uniform distribution U{—1, 1) [42]; 

3. R is a sparse matrix whose elements belong to { — 1, 0, -1-1} with probability {1/6, 2/3,1/6} [jS]. 

We note that Property 1 in Dehnition 3.2 considers columns of the random matrix R as the basis 
of a random space, thus they are required to be pairwise orthogonal [41]. To this end, one needs 
to apply an orthogonalisation technique such as the Gram-Schmidt method [44] on R. Arriaga et 
al. [42] proved that it suffices to use random non-orthonormal matrices with independent elements 
chosen from some distributions which are listed in Property 2 of Dehnition 3.2. Recently, Li et 
al. [43] proposed a sparse random projection matrix presented in Property 3 of Dehnition 3.2. 
The sparse random projection achieves a further threefold speed-up as only 1/3 of the matrix 
have non-zero elements. 

We note that the random projection is not data driven. It means that it does not need 
a set of labelled training data, making it suitable for unsupervised learning scenarios such as 
clustering [45, 46]. 

3.2. Random Projection in Riemannian Manifolds via RKHS 

As mentioned in Section 1, applying the random projection on points residing in the Rieman¬ 
nian manifold space is not trivial, due to the notion of projection itself being generally not well 
dehned. We approach this problem by reformulating the problem in the RKHS. Recall that, the 
random matrix containing column vector of hyperplanes r* should be generated from a particular 
process. Thus, the projection of each individual dimension into the projected space is carried out 
as follows: 


fi{x) = x^Vi , (8) 

where /*(■) is the Ath dimension of the projected vector x. 

In the RKHS, the above formulation can be rewritten as: 

fi{x) = ([{xYri , (9) 
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Figure 1: The illustration of our proposed framework. We first generate the hyperplanes in RKHS. 
Each point in the manifold space is then mapped into the projected space via the kernel inner 
product. Finally we apply K-means in the projected space. 

where 0(-) is the function that embeds the input space into the RKHS. Note that, in this case, 
the hyperplane r* is now defined in the RKHS, Vi G 'H. The projection in the RKHS can be 
considered as the inner product which is defined as the kernel similarity function. 

Eqn. 9 provides insight that the JL-Type projection could be achieved as long as one could 
generate the hyperplanes that follow one of the above properties in Definition 3.2 in the RKHS. 
In similar fashion, when the data point x is replaced by a point X in manifold X G Al, then one 
could use Eqn. 9 as the framework to achieve JL-Type projection in the manifold space. As such, 
we propose a framework for clustering manifold points, which is briefly illustrated in Figure 1. 
This hyperplane generation is the central idea in our work. First, we generate the hyperplanes 
over the RKHS. The points over the manifold space are then projected into the projected space 
by using the specified kernel similarity function, such as the Gaussian kernel or projection kernel. 
Once the manifold points have been embedded into the projected space, we apply the general 
K-means algorithm to perform clustering. 

In this paper, we explore three hyperplane generation methods for manifold points: (1) KGRP; 
(2) KORP and (3) KPGA-RP. The diagram of our proposed generation methods is illustrated in 
Figure 2. Briefly speaking, the hyperplanes are generated using a randomly selected subset from 
the entire dataset. The projection made by the hyperplanes will follow one of the properties in 
Definition 3.2. We will elaborate on the generation process and theoretical analysis in the following 
section. 

3.2.1. Kernelised Gaussian Random Projection (KGRP) 

In the KGRP method, the hyperplanes are generated from the standard Gaussian distribution 
Ar(0, 1). Each hyperplane G H is assumed to be spanned by a group of data points randomly 
selected. To this end, first a subset S containing p points {0(Xi),..., ^(Xp)} is randomly chosen 
from the entire dataset, 0(Xj) is the representation of manifold points Xi in the RKHS. Each 
data point 0(Xj) from the subset is considered as a vector generated from a particular distribution 
D with unknown mean pL and unknown covariance S. Thanks to the Gentral Limit Theorem 
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Figure 2: The diagram of our proposed generation methods: KORP, KGRP and KPCA-RP. 



















(CLT) [47], one can still produce standard Gaussian distribution data points from these data. 
More precisely, the CLT states that when the number of data points grows larger, the difference 
between the population mean and the sample mean approximates the normal distribution A/'(0,S). 
As such, we hrst randomly select t, t < p, data points from S and let these points be the set 
5i C S. Let Zt = j sample mean over 5i. By applying the CLT and the 

Whitening transform [48], the vector r* = — i-i) can be considered as the point generated 

from a standard Gaussian distribution; thus Uj could be used as a random projection hyperplane. 
Therefore, we denote our embedding function that projects data points in the RKHS to the random 
projection space by: 


/(0(X,)) = . (10) 

The mean is implicitly estimated as ^ = 1 'YTi=i the covariance matrix S is also formed 

over the p data points. In order to compute Eqn. 10, one could use a similar approach to that of 
Kernel Principal Component Analysis (KPCA) [49]. Specifically, let the Eigen-decomposition of 
the covariance matrix S and the kernel matrix over p data points Ks, be VXV^ and U@U^ 
respectively. Based on the fact that the non-zero eigenvalues of V are equal to the non-zero 
eigenvalues of ©, Kulis-Grauman [50] proved that Eqn. 10 is the same as: 

YJ ( 11 ) 

where 

= ( 12 ) 

^ i=i «e5i 

Note that iSi is the set of t points which are randomly selected from S. Further, defining e as a 
vector of all ones, and as a zero vector with ones in the entries corresponding to the indices of 
iSi, the expression in Eqn. 12 can be further simplihed to: 

(13) 

We note that the above formulation was first described for developing the kernelise locality sen¬ 
sitive hashing method in Euclidean scenarios [50]. We then adapted the method in our previous 
work [20] to perform random projection on SPD manifolds for classihcation purposes. Here we 
apply the method for clustering on Riemannian manifold problems. The pseudo code for KGRP 
is summarised in Algorithm 1. 

We note that the total computational complexity of the KGRP algorithm is 0{np + p^ + np^ -|- 
inmp). Specifically, there are four factors contributing to the computational complexity: 

1. Gomputing the kernel Gram matrix Kn^p between n points and p selected points which 
requires 0{np) operations {p « n); 

_ \ l‘2 

2. Generating the random hyperplanes, necessitates calculation of the kernel matrix Kg ' for 
the p points in S which requires 0{p^) operations; 

3. Projecting all of the data points into the random projection space which requires 0{np^) 
operations; 


9 




Algorithm 1 Kernelised Gaussian Random Projection (KGRP) 

Input: the entire dataset: a set of manifold-valued data points Aj G A4; the size of S : 

p; the desired projected space dimensionality : b 
Output: Xi G the data points in the projected space 

1: Randomly select p points from the entire dataset 

2: Gompute the Kernel Gram matrix Kg over points = (j){X,y 0(A,), VA,, VA, G 

let S = {(j){X denote the representations for these p points in the RKHS 
3: Gompute the projection matrix W = {lOi, w^}, \/Wi G W 
4: for i = 1 — 6 do 

5: ^ Randomly select t data points from S 

6: es = [Ai,Ap] if 0(Aj) G 5i, Aj = 1; otherwise Aj = 0 

7 : Wi = ^K'Jes 

8 : end for 

9: Project each point A* into the random projection space: Xi = KW, where K is the Gram 
matrix between Aj and the points 


4. Applying K-means to get the clustering results which requires 0{inmp) operations {£ is the 
number of iterations of K-means, m is the number of clusters and b is the dimension of the 
projected space). 

3.2.2. Kernelised Orthonormal Random Projection (KORP) 

In the second method, we generate orthonormal random hyperplanes (i.e. the hrst property). 
We hrst present the following Lemma that relates the JL-Lemma to the margin of the linear 
hyperplane in supervised learning settings [51]. 

Lemma 3.3. Consider any distribution over labelled examples in Euclidean space such that there 
exists a linear separator w'^ ■ x = with margin A. If we draw d > | [^ In |] examples Zi, • • ■ , Zd 
iid from this distribution, with probability > 1 — d, there exists a vector w' in span (zi, • • • , z^) 
that has error at most e at margin | [51]. 

Proof. We refer the readers to [51] for the proof of this Lemma. 

Remarks. Lemma 3.3 essentially states that, with a high probability, the margin is still well 
preserved (with error at most e) when the hyperplane w' is selected from the space spanned by 
a subset of the data points. Note that, as suggested in [52], when the margin is well preserved, 
then the angle and distance between points are also well preserved. 

This Lemma can also be applied for cases where the data points are in the RKHS. This is 
because the RKHS is essentially an inhnite-dimensional Euclidean space [51]. Given a set of 
points which are linearly separable with margin A under a particular kernel function, we draw d 
random examples Xi,--- , from the same distribution. Then, according to Lemma 3.3, with 
probability > 1 — d, there exists a separator in RKHS in' G "H and w' = aicf^Xi) -|- ■ ■ ■ -f adcfixd) 
with error rate at most e. Note that as ■ ([{x) = ai K{x, Xi) + ... + ad K{x, Xd), we then can 
simply consider the vector of [K(a:, Xi) ■ ■ ■ K{x, Xd)] as the feature representation of x in the space 
spanned by In other words, the K(a:,a:j) is considered as the i-th feature of x. We 

can further formalise this observation with the following Gorollary [51]. 
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Corollary 3.4. If distribution P has margin X in the RKHS, then with probability > 1 — <5, 
if Xi,--- ,Xfi are drawn from the same distribution, for d = f In , the mapping Fi(a;) = 

[K(a;, a^i) ■ ■ ■ K(a;, ajd)] produces a distribution Fi(P) on labelled examples in that is linearly 
separable with error at most e [51]. 

Remarks. The above Corollary suggests the following points: (1) one could generate random 
projection hyperplanes by randomly selecting a subset of data points in RKHS and then projecting 
a point into this space by using Fi(a;); (2) this projection is a JL-Type projection. 

In light of these facts, for our case, we randomly select p points, here denote S = {0(Xi), • • • , 0(Xp)} 
as the implicit representations of the p points in RKHS. However, as it is possible that some hy¬ 
perplanes are not linearly independent, then the hyperplanes could be highly correlated. To that 
end, one needs to orthogonalise the hyperplane set S [51]. In this work, we apply QR decom¬ 
position [44] to construct a set of orthonormal basis from the original basis spanning the same 
subspace. Let us arrange the original basis into a matrix A. Then the matrix A can 

be decomposed into Q and R as follows: 

A = [ct>{X,),--- ,ct>{X,)] = QR, (14) 

where Q is the orthonormal basis and R is the upper triangular matrix. Assuming that we have 
the orthonormal basis Q, then we can observe the following when a data point ([{X) is projected 
into the orthonormal basis Q\ 


= <t>(XyQRR 

= MXVl<l>(X,4 ,(x„)]r' 

~ -1 (15) 

= [<P{Xy<P{X,),...,<p{Xy<P{Xp)]R 

= [KiX,X,),...,KiX,Xp)]R\ 

In other words, one only needs to determine the upper triangular R in order to do the projection. 
We note that as the original basis are in the RKHS then it is not trivial to apply the 

QR decomposition to matrix A. To that end, we hrst multiply the matrix A by its transpose. 
By doing this, we will get the kernel matrix Ks, where Ks{i,j) = ([{Xi)^ ([{Xj), \/([{Xf) and 
V0(Xj) G 5. Thus: 


Ks = A^A 

= (QR^QR 

~T r ~ ( 16 ) 

= R Q^QR 

= RR . 

We can employ the Cholesky Factorisation [44] on the kernel matrix Rs, in order to compute the 
upper triangular R. Algorithm 2 outlines the algorithm for the proposed Kernelised Orthonormal 
Random Projection (KORP). 

The computational complexity of KORP depends on the following steps: 
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Algorithm 2 Kernelised Orthonormal Random Projection (KORP) 

Input: the entire dataset: a set of manifold-valned data points Xj G Ai; the desired 

projected space dimensionality : p 
Output: Xi G W the data points in the projected space 

1: Randomly select p points from the entire dataset 

2: Compnte the kernel Gram matrix over points = ^(X,)^ </,(X,), VX,, VX, G 

3 : Apply Cholesky Factorisation to the kernel matrix Ks = RR 

4 : Project each point Xj into the random projection space: Xi = KR , where K is the Gram 
matrix between Xj and the points 


1. Gompnting the kernel Gram matrix between the entire dataset and the subset S which 
requires 0{np) operations; 

2. Applying Gholesky Factorisation on the kernel Gram matrix of the p points in S which 
requires 0{p^) operations; 

3. Applying the matrix inverse of the right triangular matrix R which demands 0{p^) opera¬ 
tions; 

4. Projecting all of the data points into the orthonormal space with 0{np^) operations; 

5. Applying K-means to get the clustering results which demands 0{inmp) operations is the 
number of iterations of K-means, m is the number of clusters). 

Hence, the total computational complexity is 0{np + p^ + np^ -f inmp). 

3.3. KPCA-based Random Projection (KPCA-RP) 

Inspired by the previous method, one can derive orthonormal projections using the Kernel PGA 
(KPGA). More precisely, after generating random projection hyperplanes by randomly selecting 
the subset S, one can obtain the principal components of the data points in S by applying the 
KPGA. The principal components of S are then considered as the set of orthogonal random 
projection hyperplanes. Finally, following Eqn. 9, the entire data points can be projected into the 
random projection space using the hyperplanes. 

Let us suppose C is the covariance matrix of the points in S which have been centred: 

C ^ if2<KXt}4>(XiP. (17) 

To apply KPGA, one needs to solve the generalised eigen-decomposition problem: 

tV = CV . (18) 

Following the same argument as KPGA [49], the eigenvectors of the covariance matrix C lie in the 
span of 0(Xi), ^(Xa),.., 0(Xp): 
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( 19 ) 


P 

Vk = J2a^<P{X,) , 

i=l 

where the set {a^Yi=i can be determined by solving the following eqnation: 

PTOL = KsOL , ( 20 ) 

where ex = [ck^ ■ ■ ■ ex^] is a matrix wherein each colnmn represents the vector = [o.\ ■ ■ ■ 
whose elements are the linear combination coefficients presented in Eqn. 19 and Ks is the kernel 
matrix of the set S. Note that the above eqnation snggests that the vector cx^ is one of the 
eigenvectors oi Ks- 

Let be the set of principal components extracted from Eqn. 18. To project a point 

into the principal component we perform: 

p 

■ V, = af0(X)T0(X,) . (21) 

i=l 

In the following, we present a theorem that gnarantees that projections into the principal compo¬ 
nents of the snbset S achieves JL-Type projection. 

Theorem 3.5. If a set of points can be separated by a margin X in the RKHS, then with probability 
>1-5, ifS = {cf{X^),...,cf{Xp)], X, G M, are drawn from the same distribution 

for p = I In I], the mapping F 2 (a:) = Fi(a:)[Q:^ ■ ■ ■ cx^], where ex^ is the k-th eigenvector of Ks, 
achieves JL-Type projection with error at most e. 

Proof. As presented in Corollary 3.4, Fi(x) is the fnnetion that maps a point into a random 
projection space wherein the set of hyperplanes S is randomly selected from a set of given points. 
It is known that principal components of S represent the orthonormal bases spanning the snb- 
space spanned by S. Henceforth, compnting the principal components of S can be considered as 
orthogonalisation of the hyperplanes. 

Remarks. The above theorem states that applying KPCA on S means orthogonalising the hy¬ 
perplanes in S. Therefore, the difference between KPCA-RP and KORP is related to how the 
hyperplanes are orthogonalised. We present the KPCA-RP psendo code in Algorithm 3. 


Algorithm 3 KPCA-based Random Projection (KPCA-RP) 

Input: the entire dataset: a set of manifold-valned data points X* G Ad; the desired 

projected space dimensionality : p 
Output: {xi\Yi tbe data points in the projected space 
1: Randomly select p points {Xi\Yi from the entire dataset 

2: Compnte the kernel Gram matrix Ks over points {X^U = (t^{X,y 0(X,), VX,, VX,- G 
3: Apply KPCA to kernel matrix Ks to obtain the eigenvectors a. 

4: Project each point Xj into the random projection space: Xi = Ka, where K is the Gram 
matrix between Xj and the 
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In terms of calculating the computational complexity of the KPCA-RP algorithm, one needs 
to consider four factors: 

1. Computing the kernel Gram matrix between the entire dataset and the subset S, which 
requires 0{np) operations; 

2. Applying KPCA on the kernel Gram matrix of subset S, which requires 0{p^) operations; 

3. Projecting all of the data points into the orthonormal space, which requires 0{np^) opera¬ 
tions; 

4. Applying K-means to get the clustering results, which requires 0{inmp) operations {i is the 
number of iterations of K-means, m is the number of clusters). 

Hence, the total computational complexity is 0{np + p^ + np^ + inmp). 

4. Experimental Results 

We evaluate our proposal using six benchmark datasets: (1) Ballet dataset [53]; (2) UCSD traf- 
£c dataset [54]; (3) UCFIOI Human actions dataset [55]; (4) Brodatz texture dataset [56]; (5) KTH- 
TIPS2b material dataset [57] and (6) HEp-2 Cell ICIP2013 dataset [58]. 

In our evaluation, we consider each video of the hrst three datasets (i.e. Ballet, UCSD and 
UCFIOI) as an image set which can be effectively modelled as a point on Grassmannian manifolds. 
In addition, we use SPD manifold to model images of the latter three datasets (i.e. Brodatz, KTH- 
TIPS2b and HEp-2 Cell ICIP2013). To demonstrate the efficacy of our framework, we report the 
clustering performance and the run time. 

4.1. Datasets and Feature Extraction 

Ballet action dataset (Ballet) [53] - The Ballet dataset presents sequences of videos of ballet 
actions. More precisely, it comprises 44 sequences with 8 different actions: R-L presenting, L- 
R presenting. Presenting, Jump & swing. Jump, Turn, Step, and Stand still (see Figure 3a for 
examples). These ballet actions were performed by two men and one woman, resulting in significant 
intra-class variations such as speed, clothing and movements. In this evaluation, each video is 
considered as an image set. We then represent each image set as a point in the Grassmannian 
manifold. To that end, all the videos are down sampled to 16 x 16 pixels. A Grassmannian point is 
extracted for every 6 consecutive frames. Technically, we hrst vectorise each frame into a column 
vector and arrange them into a 256 x 6 tall matrix (i.e. 256 = 16 x 16). The matrix can be 
considered as a subspace and the orthonormal bases spanning the subspace can be determined by 
applying the Singular Value Decomposition (SVD). The set of orthonormal bases is considered as 
a Grassmannian point [21]. We use the projection kernel (see Eqn. 6) in this evaluation. 

UCSD traffic dataset (UCSD) [54] - The UCSD traffic dataset consists of 254 video sequences 
collected from the highway traffic over two days in Seattle (see Figure 3b for examples). It contains 
a variety of traffic patterns and weather conditions (i.e. overcast, raining, sunny). In total, there 
are 44 sequences of heavy traffic (slow, stop and go speeds), 45 sequences of medium traffic 
(reduced speed), and 165 sequences of light traffic (normal speed). To extract a Grassmannian 
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point, we first randomly select half the number of frames from each video. Each frame in each 
sequence is downsized to 140 x 161 pixels and further normalised by subtracting the mean frame 
and dividing the variance. Then, we apply the two dimensional Discrete Cosine Transform (DCT) 
on the frame and use the DCT coefficients as the feature vector for each frame. SVD is applied 
on the feature vectors of the frames to obtain the set of orthonormal bases. We also choose the 
projection kernel (see Eqn. 6) for this dataset. 

UCFIOI Human Actions dataset (UCFIOI) [55] - This dataset consists of 13,320 videos 
that belong to 101 categories. For example. Applying Eye Makeup, Blow Dry Hair and Mixing 
Batter (refer to Figure 3c). For each video, we hrst extract the normalised pixel intensities as 
features for all the frames. Then the SVD is applied on these features of each video to obtain the 
Grassmannian manifold point. Thus, in this dataset, there are 13, 320 manifold points in total. 
Projection kernel (see Eqn. 6) is used. 

Brodatz texture dataset (Brodatz) [56] - For the Brodatz dataset (refer to Figure 4a for 
examples) we follow the protocol presented in [59]. The protocol includes 3 subsets with different 
numbers of classes: 5-class-texture (5c, 5m, 5v, 5v2, 5v3), 10-class-texture (10, lOv) and 16-class- 
texture (16c, 16v). Each image is down-sampled to 256 x 256 pixels and divided into 64 32 x 32 pixel 
size regions. A feature vector F{x, y) for each pixel is calculated using the grayscale intensities 
and absolute values of the hrst- and second-order derivatives of spatial feature vectors. It can 

Each region is represented by a 
The Gaussian Kernel with 


be illustrated as: F{x,y) = J(a;, 2 /),||^' 

covariance matrix (SPD matrix) formed from these fea 
Log-Euclidean distance (see Eqn. 3) is used for this dataset. 

KTH-TIPS2b material dataset (KTH-TIPS2b) [57] - This dataset contains 11 material 
categories captured under 4 different illuminations, in 3 poses and at 9 scales (refer to Figure 4b). 
Thus, there are 3x4x9 = 108 images for each sample in one category, with 4 samples per 
material. We extract a 20-dimensional feature vector for each pixel in the image: 


92 / 


dx^ 


02/ 


aj/2 

ure vectors 


\I(x, v), Y(x, y), Ct{x, y), Cr(x, y), • • • F£)(V)I. 


( 22 ) 


where I{x,y) is the image grey level value at location {x,y)-, Y, Cb and Cr are the perceptually 
uniform CIELab colour space; The hlter banks F* consist of different of offset Gaussians applied 
on the luminance channel Y [60] . The covariance matrix is computed once the feature vectors are 
extracted from every pixel location. This becomes the image representation over a SPD manifold. 
For the manifold kernel in this dataset, we use Gaussian kernel with the Stein Divergence (see 
Eqn. 4) as this has been shown to be effective in various classihcation problem domains [20, 61]. 
HEp-2 Cell ICIP2013 dataset [58] - This dataset contains 13, 596 cell images that include 
six cell patterns namely Gentromere, Golgi, Homogeneous, Nucleolar, Nuclear Membrane, and 
Speckled (refer to Figure 4c). The cell boundary of every cell image is described by a mask image 


of the same size. For each cell image, we hrst extract the following 


that belongs to the cell content: F{x,y) = 


I dll 
I dx I 


<Hx,y) , 


d^i 


9x2 


eature vector of each pixel 
92 / 


9y2 


,arctan(|f|/ g 


rom these feature vectors extractec 


from 


Then, the covariance matrix (SPD matrix) is formed 
each image. We also use Gaussian kernel with the Stein Divergence (see Eqn. 4) for the evaluation 
on this dataset. 


15 

























Figure 3: Examples from (a) Ballet action dataset [53] (b) UCSD traffic dataset [54] and 
(c) UCFIOI dataset [55] 


4-2. Experimental Settings 

As illustrated in Figure 1, we first randomly project the points and then apply K-means. As 
such, for each dataset, we hrst run each proposed projection method 10 times to generate 10 
different random projection representations. Then, for each representation, we run the K-means 
algorithm 10 times, resulting in each method being repeated 100 times for each evaluation. The 
average of clustering performance and run time were reported. As the source of variation for the 
other approaches is predominantly on the initial cluster seeds of K-means, we only repeat the 
experiment 10 times to obtain the average clustering performance and run time. 

All of the approaches are tuned to give the best performance. We hnd the optimum size of set 
5 as follows: (1) Ballet: 100; (2) UCSD: 90; (3) UCFIOI: 101; (4) Brodatz: 100; (5) KTH-TIPS2b: 
48 and (6) HEp-2 Cell ICIP2013: 60. In addition, for KGRP, we set the number of dimensionality, 
b, to 300. 

To measure the clustering quality, there are two main types of metrics: internal metrics based 
on the distances between data points in the space, and external metrics based on the labels of 
the data [62] . The clustering task in our proposed framework is performed in a transformed space 
which may have different scale to other spaces used by comparable methods such as LogE (see 
below for further discussion on LogE). This may make the internal metrics such as Dunn Index 
unsuitable in our case. Thus, we choose four external metrics to measure the clustering quality: 
Rand Index (RI), Cluster Purity (CP), F-Measure and Normalized Mutual Information (NMI). 
Interested readers are referred to [62] for further explanation of each metric. In addition, we also 
measure the run time (in seconds) of each approach on every evaluation. The run time is measured 
from the kernel matrix computation until the completion of the clustering process. Finally, we 
report the average run time of the approaches. 

Our proposal is contrasted to six approaches: (1) Intrinsic K-means [5]; (2) Log-Euclidean K- 
means [13]; (3) Kernel K-means [2, 15]; (4) KPCA K-means [49, 15]; (5) Sigma set K-means [63] 
and (6) Grassmanian clustering [21]. The following is the brief description of each approach. 



(a) 



Figure 4: Examples from (a) BRODATZ texture dataset [56], (b) KTH-TIPS2b material 
dataset [57] and (c) HEp-2 Gell IGIP2013 dataset [58]. 
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Table 2: The clustering quality with variance (in %) measured by Rand Index (RI), Cluster 
Purity (CP), F-Measure and Normalized Mutual Information (NMI) on Ballet dataset. The best 
performance is in bold. We refer to Section 4.2 for further explanation of each approach. 


Methods/Measurements 

RI 

CP 

F-Measure 

NMI 

Intrinsic [5] 

73.68±0.00 

34.92±0.00 

33.81 ±0.00 

21.73 ±0.00 

LogE [13] 

78.23±0.15 

20.85±2.66 

14.81 ±0.37 

3.91 ±0.81 

G-clustering [21] 

76.41±0.07 

18.63±0.58 

16.39 ±0.25 

3.51 ±0.47 

Kernel K-means [2, 15] 

79.89T0.80 

40.86±3.06 

32.92 ±3.21 

32.00 ± 2.73 

KPCA [49, 15] 

78.62±2.14 

42.30±3.33 

36.27 ±2.68 

34.80 ± 3.48 

KGRP 

78.02±1.79 

41.89±2.43 

37.98 ± 2.79 

34.05 ± 2.41 

KORP 

78.28±1.68 

42.54T2.37 

38.68 ±2.81 

35.30 ±2.80 

KPCA-RP 

77.81±1.94 

41.90±2.31 

38.23 ±3.11 

34.64 ± 2.75 


Intrinsic K-means (Intrinsic): To cluster a set of manifold points, Intrinsic K-means works 
directly on the manifold space using the appropriate geodesic distance [5]. We note that as the 
intrinsic approach is generally very slow, we stop the Intrinsic K-means after 100 iterations. 
Log-Euclidean K-means (LogE): We hrst project all of the manifold points into the tangent 
space at the identity [18]. Once projected, each point will be vectorised into a column vector. As 
for SPD manifolds, we follow the work in [9] that uses only the upper triangular elements. This 
trick will reduce the hnal representation dimensionality, markedly reducing the run time on the 
subsequent process. Unfortunately the trick cannot be used on Grassmannian manifolds since the 
representation for a point on the Grassmannian manifold is not a symmetric matrix. In this case, 
all the elements are used in the hnal representation. This could adversely affect the overall run 
time when the manifold dimensionality is high. In the hnal step, K-means algorithm is applied. 
Log-Euclidean k-means has been used for clustering large amount of manifold data [13]. 

Kernel K-means: This approach embeds manifold points into RKHS. Then Kernel K-means is 
applied to perform clustering [2, 15]. 

KPCA K-means (KPCA): All manifold points are hrst embedded into RKHS. Then, KPGA is 
used for projecting the points in RKHS into the space spanned by the principal components [49, 15]. 
Finally, the K-means is applied. 

Sigma set K-means (SIS): Hong et ah [63] proposed a novel descriptor for SPD manifolds 
which simplihes the computations of distance and mean. Using their proposed descriptors, we 
apply K-means with novel efficient computations of mean and distance. 

Grassmanian clustering (G-clustering) Shirazi et ah [21] proposed a clustering method for 
Grassmanian manifolds which use the eigenvectors of the normalised projection kernel matrix as 
the new features of Grassmanian points. 

A 5*. Comparative Analysis on Clustering Quality 

Tables 2, 3, 4, 5, 6 and 7 report the average clustering quality of each individual approach ap¬ 
plied on each dataset. In general, our proposed methods perform reasonably well and show a close 
match to KPGA K-means and Kernel K-means. Also, the performance of the proposed approaches 
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Table 3: The clustering quality with variance (in %) measured by Rand Index (RI), Cluster 
Purity (CP), F-Measure and Normalized Mutual Information (NMI) on UCSD dataset. The best 
performance is in bold. We refer to Section 4.2 for further explanation of each approach. 


Methods/Measurements 

RI 

CP 

F-Measure 

NMI 

Intrinsic [5] 

73.26±0.00 

74.70±0.00 

75.15 ±0.00 

36.18 ±0.00 

LogE [13] 

55.24±3.25 

67.23±2.66 

40.39 ±2.81 

19.11 ±3.59 

G-clustering [21] 

50.68±0.11 

64.82±0.00 

34.31 ±0.12 

0.92 ±0.29 

Kernel K-means [2, 15] 

69.98±7.06 

77.96±4.77 

57.34 ± 10.22 

45.50 ±9.71 

KPCA [49, 15] 

77.90T5.97 

80.08±2.96 

69.29 ±7.56 

51.31 ±6.09 

KGRP 

75.61±3.48 

79.64±2.07 

66.97 ±5.17 

48.29 ± 3.80 

KORP 

77.25±1.25 

80.18T0.74 

68.99 ± 1.62 

50.58 ±1.67 

KPCA-RP 

76.46±2.79 

79.64±1.68 

68.60 ±3.50 

49.74 ± 3.02 


Table 4: The clustering quality with variance (in %) measured by Rand Index (RI), Cluster 
Purity (CP), F-Measure and Normalized Mutual Information (NMI) on UCFIOI dataset. The 
best performance is in bold. We refer to Section 4.2 for further explanation of each approach. 


Methods/Measurements 

RI 

CP 

F-Measure 

NMI 

Intrinsic [5] 

LogE [13] 

Kernel K-means [2, 15] 
KPCA [49, 15] 

97.53 ±0.00 

97.89 ±0.02 

97.71 ±0.06 

97.69 ±0.02 

12.94 ±0.00 

8.21 ±0.15 

15.97 ±0.48 

17.66 ±0.33 

7.43 ± 0.00 

3.62 ±0.06 

8.80 ±0.35 

9.47 ±0.19 

27.65 ± 0.00 

18.68 ±0.07 

32.35 ±0.31 

33.66 ±0.18 

KGRP 

KORP 

KPCA-RP 

97.90 ± 0.03 

97.90 ±0.02 

97.89 ±0.03 

15.38 ±0.28 

15.69 ±0.28 

15.66 ±0.33 

7.40 ±0.15 

7.62 ±0.17 

7.59 ±0.17 

30.96 ±0.21 

31.47 ±0.17 

31.38 ±0.23 


Table 5: The clustering quality with variance (in %) measured by Rand Index (RI), Cluster Purity 
(CP), F-Measure and Normalized Mutual Information (NMI) on BRODATZ dataset. The best 
performance is in bold. We refer to Section 4.2 for further explanation of each approach. 


Methods/Measurements 

RI 

CP 

F-Measure 

NMI 

Intrinsic [5] 

92.29±0.00 

79.05±0.00 

74.20 ± 0.00 

75.94 ± 0.00 

SIS [63] 

91.42±0.00 

76.99±0.00 

69.68 ±0.00 

72.84 ± 0.00 

LogE [13] 

92.04±0.78 

78.34±2.34 

74.10 ±2.10 

76.13 ± 1.45 

Kernel K-means [2, 15] 

93.15±0.95 

81.40±2.75 

75.62 ±2.13 

78.19 ±1.83 

KPCA [49, 15] 

93.89±0.22 

82.60±1.14 

76.64 ± 0.66 

79.44 ±0.57 

KGRP 

93.47±0.78 

82.22±2.34 

75.84 ± 1.82 

78.49 ± 1.49 

KORP 

93.66±0.77 

82.58±2.32 

76.30 ± 1.81 

79.11 ± 1.50 

KPCA-RP 

93.77±0.84 

82.81±2.49 

76.39 ± 1.93 

79.16 ±1.56 
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Table 6: The clustering quality with variance (in %) measured by Rand Index (RI), Cluster Purity 
(CP), F-Measure and Normalized Mutual Information (NMI) on KTH-TIPS2b dataset. The best 
performance is in bold. We refer to Section 4.2 for further explanation of each approach. 


Methods/Measurements 

RI 

CP 

F-Measure 

NMI 

Intrinsic [5] 

86.99±0.00 

49.45±0.00 

36.19 ±0.00 

44.20 ± 0.00 

SIS [63] 

80.81±0.00 

41.62±0.00 

44.45 ±0.00 

40.47 ±0.00 

LogE [13] 

85.94±0.60 

45.19±1.32 

33.48 ± 1.01 

40.69 ± 0.82 

Kernel K-means [2, 15] 

88.35±0.35 

52.59±1.37 

41.11 ± 1.01 

51.08 ±0.82 

KPCA [49, 15] 

88.48T0.40 

53.38T1.53 

41.22 ± 1.35 

50.97 ±0.90 

KGRP 

88.41±0.42 

53.15±1.34 

40.48 ± 0.99 

49.87 ± 1.01 

KORP 

88.36±0.39 

53.04±1.10 

40.61 ±0.93 

50.06 ±0.92 

KPCA-RP 

88.35±0.44 

53.45±1.35 

40.21 ±0.91 

49.97 ± 1.09 


Table 7: The clustering quality with variance (in %) measured by Rand Index (RI), Cluster Purity 
(CP), F-Measure and Normalized Mutual Information (NMI) on IIEp-2 Cell ICIP2013 dataset. 
The best performance is in bold. We refer to Section 4.2 for further explanation of each approach. 


Methods/Measurements 


RI 


CP F-Measure NMI 


Intrinsic [5] 

73.96 

± 

0.00 

44.02 

± 

0.00 

35.69 

± 

0.00 

22.65 

± 

0.00 

SIS [63] 

74.50 

± 

0.00 

39.50 

± 

0.00 

27.32 

± 

0.00 

18.01 

± 

0.00 

LogE [13] 

74.80 

± 

0.95 

46.00 

± 

2.37 

34.75 

± 

0.86 

23.64 

± 

1.29 

Kernel K-means [2, 15] 

73.96 

± 

2.14 

46.45 

± 

3.30 

37.29 

± 

3.29 

24.29 

± 

1.95 

KPCA [49, 15] 

75.74 

± 

2.87 

48.48 

± 

1.94 

34.23 

± 

2.20 

25.29 

± 

0.00 

KCRP 

75.72 

± 

0.31 

49.05 

± 

1.10 

34.83 

± 

0.84 

25.74 

± 

0.82 

KORP 

75.63 

± 

0.62 

48.70 

± 

2.34 

34.73 

± 

1.67 

25.49 

± 

1.72 

KPCA-RP 

75.72 

± 

0.41 

48.70 

± 

2.56 

34.48 

± 

1.74 

25.46 

± 

1.93 


is similar to each other. These factors suggest that the proposed projection approaches possess the 
JL-Type projection properties. Furthermore, we hnd that the proposed approaches in some cases 
have markedly better performance than the Kernel K-means. One of the possible reasons could 
be that the random projection reduces the eccentricity of original Gaussian-distributed clusters 
and make clusters in projected spaces more spherical [64]. 

Intrinsic K-means gives us reasonable results as it directly works on manifold space. Compared 
to the intrinsic approach, LogE has a worse performance in most of datasets. An exception is 
on the Ballet dataset where the intrinsic approach has a worse Rand Index than the LogE. We 
conjecture that this is caused by the failure of the intrinsic algorithm to converge in 100 iterations. 
Nevertheless, the other performance metrics such as CP, F-Measure and NMI for the intrinsic 
approach in this dataset still show reasonable performance. The worse performance for LogE is 
due to signihcant distortion of the pairwise distance produced when the points are projected into a 
tangent space. The G-clustering has a better Rand Index than the intrinsic approach in the Ballet 
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Figure 5: Clustering quality of the proposed KPCA-RP when the kernel parameter {3 was varied 
on the Ballet dataset. The clustering quality is measured by: Rand Index (RI), Cluster Purity 
(CP), F-Measure and Normalized Mutual Information (NMI). 


dataset, which is a similar conclusion drawn in the original work proposing the approach [21]. Note 
that the measurements for clustering performance are different from that in [21]. In most cases, 
the G-clustering is not robust as the performance of G-clustering measured by GP, F-measure and 
NMI is usually low. In addition, we do not report the G-clustering results for the UGFlOl dataset, 
as the K-means does not converge within a specihed amount of time. 

We found the performance of our proposed methods does not change signihcantly, when the 
parameters are varied. Figures 5 and 6 show two examples of the clustering results of KPGA-RP 
and KORP with different parameters on the Ballet and HEp-2 Gell IGIP2013 dataset, respectively. 
We note that the results on the other datasets also exhibit similar trends. This suggests that the 
issue raised in [23], where different parameters may adversely alter the kernel space, may not have 
signihcant effect on our work. We conjecture that this might be due to the selected manifold 
kernels crafted to capture the manifold intrinsic structure. However, if in the case where the 
parameter choice of the manifold kernel signihcantly contributes to the clustering results, one 
could use a randomly selected small subset of data to perform the parameter search. 

The evaluation has clearly shown that our proposal has similar performance to the kernel 
methods such as KPGA K-means and Kernel K-means. Indeed, these results alone do not give 
us much advantage over the other methods. However, we now present the main advantage of our 
proposal which is a direct consequence of applying random projection. 

3.4- Run Time Comparative Analysis 

Table 8 presents the average run time of the individual approach on each dataset. One of the 
striking observations from this table is that our proposed approaches have very fast run times. In 
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Figure 6: Clustering quality of the proposed KORP when the parameter (3 is varied on the HEp-2 
Cell ICIP2013 dataset. The clustering quality is measured by: Rand Index (RI), Cluster Purity 
(CP), F-Measure and Normalized Mutual Information (NMI). 


Table 8: The run time (in seconds) of the approaches on each dataset. Lower run time is better. 
As in each iteration of K-means, the run time is extremely similar, we report the average run time 
of each approach without variance. The datasets presented in the hrst three columns (i.e. Ballet, 
UCSD and UCFIOI) are modelled in Grassmannian manifolds, whilst the other three (i.e. Bro- 
datz, KTH-TIPS2b and HEp-2 Cell ICIP2013 (shorten as Cell)) are modelled in SPD manifolds. 
The last three rows are the proposed approaches. SIS and G-clustering are only applicable for 
SPD manifolds and Grassmannian manifolds, respectively. We refer to Section 4.2 for further 
explanation of each approach. 


Methods/Dataset 

Ballet 

UGSD 

UCFIOI 

Brodatz 

KTH-TIPS2b 

Cell 

Intrinsic [5] 

3966.49 

1990.02 

1.64 X 10 ® 

24.63 

938.95 

564.49 

SIS [63] 

N/A 

N/A 

N/A 

4.77 

60.43 

185.81 

LogE [13] 

3.35 

1.55 

9088.11 

0.15 

4.85 

2.32 

G-clnstering [21] 

2.81 

0.74 

N/A 

N/A 

N/A 

N/A 

Kernel K-means [2, 15] 

1.06 

0.70 

2019.55 

22.57 

675.75 

2172.87 

KPCA [49, 15] 

1.47 

0.73 

6.11 X 10 ^ 

22.42 

699.34 

2881.10 

KGRP 

0.51 

0.53 

238.64 

7.08 

14.61 

21.95 

KORP 

0.58 

0.49 

101.87 

7.03 

11.75 

17.73 

KPCA-RP 

0.60 

0.49 

102.79 

7.75 

12.28 

17.73 
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some cases (i.e. Ballet, UCSD and UCFIOI datasets) they outperform the LogE which is expected 
to be the fastest method. The bottleneck suffered by LogE in these datasets is from the high 
dimensionality of the feature vectors signihcantly slowing the K-means algorithm. Note that, 
although the run time of LogE on Brodatz, KTH-TIPS2b and HEp-2 Cell ICIP2013 dataset is 
quicker than our proposed methods, the clustering quality shown in Tables 5, 6 and 7 is much 
worse than that of ours. 

The proposed approaches are considerably faster than the kernel approaches such as KPCA 
K-means and Kernel K-means. This is because the proposed approaches only compute the kernel 
matrix on a small subset of data points. The beneht will become more pronounced for large 
datasets such as KTH-TIPS2b, UCFIOI and HEp-2 Cell ICIP2013 datasets where our proposed 
approach achieves 57.5 (i.e. ~ 57.5), 19.8 (i.e. ~ 19.8) and 122.5 times (i.e. ~ 

112.5) speed up, respectively. Thus, the proposed approaches will contribute signihcantly to the 
clustering of large amount of images or video data for practical applications. 

The speed up gained by the proposed approaches is attributed to the effect of applying ran¬ 
dom projection into a reduced projection space. The proposed approaches also have additional 
advantages over the kernel approaches as they do not need to compute the kernel matrix on the 
entire dataset. 

In addition, we analyse the computational complexity of each method in Table 9. In general, 
each method has two main steps: (1) Data pre-processing and (2) K-means steps. Data pre¬ 
processing may include kernel computation and/or projection. Whilst, K-means step comprises 
cluster membership and cluster mean computations. In Intrinsic K-means, the pre-processing step 
is not required. To calculate mean of each cluster, one need to use the intrinsic mean, denoted 
Karcher mean [10] that requires multiple iterations to converge. The intrinsic distance is also used 
for membership computation. For LogE, each manifold point needs to be projected onto the Log- 
Euclidean space. This projection is done once. Then, K-means is applied in the Log-Euclidean 
space. The computational complexity of KPCA and Kernel K-means follows quadratic and cubic 
growth, respectively. However, our proposed methods have linear growth, as the number of data 
points, n, is much bigger than the size of subset, p. This further corroborates the results presented 
in Table 8. 

4-5. Further Analysis 

In this section, we analyse the parameters contributing to the performance and run time of the 
proposed methods. Due to space limitations, we only show the performance measured by RI and 
CP. Note that the performance measured by F-Measure and NMI also follows the same trends. An 
obvious parameter is the projected space dimensionality, k. When k is small, each data point will 
be represented in a much smaller feature vector, resulting in faster K-means clustering processes. 
Another parameter is |5|, the size of set S which determines the run time of the kernel matrix 
computation. As |iS| gets larger, it takes longer to compute the kernel matrix. Smaller |5| gives 
more advantage to the proposed methods over the kernel approaches such as Kernel K-means and 
KPCA that require kernel computation on the entire data points. We note that k and |5| have an 
interesting relationship. More precisely, for KORP and KPCA-RP, |5| determines the projected 
space dimensionality, k. Therefore, it is desirable to make |iS| as small as possible whilst still 
preserving as much of the pairwise distance. 
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Table 9: Computational complexity of the approaches on each dataset. The dimensionality of 
SPD and Grassmannian points is d x d and q x d, respectively. For convenience, Q is used to 
represent Grassmannian manifold in this table. Note that: n is the number of points; m is the 
number of clusters; £ is the number of iterations of K-means; ikar is the number of iterations of 
Karcher mean; b is the dimensionality of the random projection space generated by KGRP and p is 
the dimensionality of the random projection space generated by KORP and KPCA-RP {p = |5|). 



Compute 

Kernel 

Compute 

Projection 

Compute 

Mean 

Compute 

Membership 

Overall 

Complexity 

Intrinsic(SPD) [5] 

N/A 

N/A 

0{Ukarnd?) 

0{£nmd^) 

0{££karnd?' + £mnd?') j 

Intrinsic (0) [5] 

N/A 

N/A 

0(UkaTn{q(P + d?)) 

0(inm(q<p + d?)) 

0({££kaT'n- + £nm){qd?‘ + d?)) 

SIS [63] 

N/A 

0{nd^) 

0{ind?) 

0{£nmd^) 

0(£nmd^) 

LogE(SPD) [13] 

N/A 

0{nd?') 

0[ind?) 

0{£nmd^) 

0{nd?' + £nmd?) 

LogE(e?) [13] 

N/A 

OinqcP) 

0{£nqd) 

0(£nmqd) 

0{nq(P + £nmqd) 

G-clustering [21] 

0{nh 

0{n^) 

0(£n^) 

0{£rdm) 

0{n^ + £rdm) 

Kernel K-means [2, 15] 

0{nd 

N/A 

N/A 

0{£rdm) 

0{£rdm) 

KPCA [49, 15] 

0("h 

o(uh 

0(£n^) 

0{£n^m) 

0(n^ + £n^m) j 

KGRP 

0(np) 

0(p^ + np^) 

0(£npb) 

0{£nmb) 

0{np + p^ + np^ + £nmb) 

KORP 

0(np) 

0{p^ + np^) 

0{£np) 

0{£nmp) 

0{np + p^ + np^ + £nmp) 

KPCA-RP 

0{np) 

0{p^ + np^) 

0{£np) 

0(£nTnp) 

0(np + p^ + np^ + £nmp) 
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Figure 7: The Rand Index (in %) of the proposed approaches when the size of set S is progressively 
increased on the KTH-TIPS2b dataset. KGRP: Kernelised Gaussian Random Projection; KORP: 
Kernelised Orthonormal Random Projection; KPCA-RP: Kernel PGA based Random Projection. 


In contrast to KORP and KPCA-RP, KGRP separates the projected space dimensionality to 
|5|. Nevertheless, we found that |5| still plays an important role in the overall system perfor¬ 
mance. To verify this, we vary |iS| on the KTH-TIPS2b. As we can see from Figures 7 and 8, 
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0.1 0.2 0.4 0.6 1.0 2.0 5.0 10.0 20.0 40.0 80.0 100.0 

Size of S in proportion to the entire KTH-TIPS2b dataset (in %) 

-•-KORP -•-KGRP ^KPCA-RP 

Figure 8: The Cluster Purity (in %) of the proposed approaches when the size of set S is progres¬ 
sively increased on the KTH-TIPS2b dataset. KGRP: Kernelised Gaussian Random Projection; 
KORP; Kernelised Orthonormal Random Projection; KPCA-RP: Kernel PGA based Random 
Projection. 

the performance of the proposed approaches increases as |5| is progressively increased. The per¬ 
formance increase stops when |5| reaches a particular value. In this analysis we also found that 
the performance of KORP and KPGA-RP is markedly better than KGRP when |5| is consider¬ 
ably small. A possible reason is that the GLT requires the set S to have a minimum number of 
elements (normally 30) in order to make the theorem applicable. 

The above observation suggests the following facts about |5|; (1) |iS| determines the run time 
for all the proposed approaches (i.e. on the kernel computation); (2) |5| also contributes to the 
K-means run time for KORP and KPGA-RP; (3) the lower bound of |5| in the KGRP is related 
to the lower bound of the GLT and (4) the lower bound of |iS| for KORP and KPGA-RP is related 
to the lower bound of k. 

The JL-Lemma relates k to the total number of data points, n (refer to Lemma 3.1). This 
relationship seems unfavourable for KORP and KPGA-RP as this would mean |5| increases as 
n increases. Fortunately, Lemma 3.3 and Theorem 3.5 suggest that k is related to the margin 
between classes. This means that we now need only consider the separating margin to select |iS|. 
To further corroborate this empirically, we apply the proposed approaches by varying the dataset 
size of the KTH-TIPS2b. We assume that the margin is relatively unchanged though the dataset 
size is varied. More precisely, we hrst £x |5| for each proposed approach. Then we randomly select 
the data points from the KTH-TIPS2b to create a smaller version of the dataset. The proposed 
approaches are applied on these smaller subsets of the dataset. Note that although |5| is hxed, 
we still select the elements of S from the given subset. The results shown in Figures 9 and 10 
suggest that the proposed approaches still have on par performance with both Kernel K-means 
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Figure 9: The Rand Index (in %) of the proposed approaches, Kernel K-Means and KPCA applied 
on subsets of KTH-TIPS2b with various sizes. We £x liSI for all subsets. 
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Figure 10: The Cluster Purity (in %) of the proposed approaches, Kernel K-Means and KPCA 
applied on subsets of KTH-TIPS2b with various sizes. We £x |5| for all subsets. 

and KPCA K-means, suggesting that |iS| relates to the margin separation between classes. 

5. Conclusions and Future Directions 

Clustering over Riemannian manifolds plays an important role in the automatic analysis of 
images and videos [5, 21]. As discussed before, in general, the existing methods suffer from either 
poor performance or high computational complexity. In this paper, we propose a novel framework 
with random projection to tackle the clustering problems over Riemannian manifolds. Based on the 
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framework, we present three random projection methods for manifold points: KGRP, KORP and 
KPCA-RP. Through experiments on several computer vision applications, we demonstrate that our 
proposed framework achieves signihcant speed increases while maintaining clustering performance 
in comparison to the other conventional methods such Kernel K-means. Furthermore, we analyse 
the parameters that impact the performance and run time of our proposed methods. 

In the proposed framework, we carry out random projection for manifold points with the aid of 
RKHS. In other words, we hrst project manifold points into RKHS. One promising future direction 
is to study the intrinsic random projection, which directly maps manifold points into the random 
projection space. To this end, one needs to dehne the notions of projection and hyperplane gen¬ 
eration process in the manifold space. 
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